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An important issue in releasing individual data is to protect the sensitive information from 
being leaked and maliciously utilized. Famous privacy preserving principles that aim to ensure 
both data privacy and data integrity, such as fc-anonymity and /-diversity, have been extensively 
studied both theoretically and empirically. Nonetheless, these widely-adopted principles are still 
insufficient to prevent attribute disclosure if the attacker has partial knowledge about the overall 
sensitive data distribution. The ^-closeness principle has been proposed to fix this, which also has 
the benefit of supporting numerical sensitive attributes. However, in contrast to fc-anonymity 
and /-diversity, the theoretical aspect of f-closeness has not been well investigated. 
Q ' We initiate the first systematic theoretical study on the i-closeness principle under the 

commonly-used attribute suppression model. We prove that for every constant t such that 
< t < 1, it is NP-hard to find an optimal i-closeness generalization of a given table. The proof 
consists of several reductions each of which works for different values of t, which together cover 
the full range. To complement this negative result, we also provide exact and fixed-parameter 
algorithms. Finally, we answer some open questions regarding the complexity of fc-anonymity 
and /-diversity left in the literature. 
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1 Introduction 



Privacy-preserving data publication is an important and active topic in the database area. Nowa- 
days many organizations need to publish microdata that contain certain information, e.g., medical 
condition, salary, or census data, of a collection of individuals, which are very useful for research 
and other purposes. Such microdata are usually released as a table, in which each record (i.e., row) 
corresponds to a particular individual and each column represents an attribute of the individuals. 
The released data usually contain sensitive attributes, such as Disease and Salary, which, once 
leaked to unauthorized parties, could be maliciously utilized and harm the individuals. Therefore, 
those features that can directly identify individuals, e.g., Name and Social Security Number, should 
be removed from the released table. See Tabled] for example of an (imagined) microdata table that 
a hospital prepares to release for medical research. (Note that the IDs in the first column are only 
for simplicity of reference, but not part of the table.) 
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Table 1: The raw microdata table. 
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Table 2: A 3-anonymous partition. 



Nonetheless, even with unique identifiers removed from the table, sensitive personal information 
can still be disclosed due to the linking attacks \27\ I28j. which try to identify individuals from the 
combination of quasi-identifiers. The quasi-identifiers are those attributes that can reveal partial 
information of the individual, such as Gender, Age, and Hometown. For instance, consider an 
adversary who knows that one of the records in Table [U corresponds to Bob. In addition he knows 
that Bob is around thirty years old and has a Master's Degree. Then he can easily identify the 
third record as Bob's and thus learns that Bob has a heart disease. 

A widely-adopted approach for protecting privacy against such attacks is generalization, which 
partitions the records into disjoint groups and then transforms the quasi-identifier values in each 
group to the same form. (The sensitive attribute values are not generalized because they are usually 
the most important data for research.) Such generalization needs to satisfy some anonymization 
principles, which are designed to guarantee data privacy to a certain extent. 

The earliest (and probably most famous) anonymization principle is the k- anonymity principle 
proposed by Samarati |27| and Sweeney [28] . which requires each group in the partition to have 
size at least k for some pre-specified value of k; such a partition is called k-anonymous. Intuitively, 
this principle ensures that every combination of quasi-identifier values appeared in the table is 
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Table 3: A 2-diverse partition of Table 1. 



indistinguishable from at least k — 1 other records, and hence protects the individuals from being 
uniquely recognized by linking attacks. The /c-anonymity principle has been extensively studied, 
partly due to the simplicity of its statement. Table [2] is an example of a 3-anonymous partition of 
Table [U which applies the commonly- used suppression method to generalize the values in the same 
group, i.e., suppresses the conflicting values with a new symbol '*'. 

A potential issue with the /c-anonymity principle is that it is totally independent of the sensitive 
attribute values. This issue was formally raised by Machanavajjhala et al. [19] who showed that 
/c-anonymity is insufficient to prevent disclosure of sensitive values against the homogeneity attack. 
For example, assume that an attacker knows that one record of Table [2] corresponds to Danny, who 
is an elder with a Doctorate Degree. From Table [2] he can easily conclude that Danny's record 
must belong to the third group, and hence knows Danny has a cancer since all people in the third 
group have the same disease. To forestall such attacks, Machanavajjhala et al. |19j proposed the 
l-diversity principle, which demands that at most a 1/7 fraction of the records can have the same 
sensitive value in each group; such a partition is called l-diverse. Table [3] is an example of a 2- 
diverse partition of Table HJ (There are some other formulations of /-diversity, e.g., one requiring 
that each group comprises at least / different sensitive values.) 

Li et al. [17] observed that the /-diversity principle is still insufficient to protect sensitive 
information disclosure against the skewness attack, in which the attacker has partial knowledge of 
the overall sensitive value distribution. Moreover, since /-diversity only cares whether two sensitive 
values are distinct or not, it fails to well support sensitive attributes with semantic similarities, 
such as numerical attributes (e.g., the salary). 

To fix these drawbacks, Li et al. [TTJ introduced the t-closeness principle, which requires that 
the sensitive value distribution in any group differs from the overall sensitive value distribution 
by at most a threshold t. There is a metric space defined on the set of possible sensitive values, 
in which the maximum distance of two points (i.e., sensitive values) in the space is normalized to 
1. The distance between two probability distributions of sensitive values are then measured by 
the Earth-Mover Distance (EMD) [26], which is widely used in many areas of computer science. 
Intuitively, the EMD measures the minimum amount of work needed to transform one probability 
distribution to another by means of moving distribution mass between points in the probability 
space. The EMD between two distributions in the (normalized) space is always between and 1. 
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We will give an example of a i-closeness partition of Table Q] for some threshold t later in Section El 
after the related notation and definitions are formally introduced. 

The t-closeness principle has been widely acknowledged as an enhanced principle that fixes the 
main drawbacks of previous approaches like /c-anonymity and /-diversity. There are also many 
other principles proposed to deal with different attacks or for use of ad- hoc applications; see, e.g., 
[3j [201 E2 Ell EH ESH E2 El] and the references therein. 

1.1 Theoretical Models of Anonymization 

It is always assumed that the released table itself satisfies the considered principle (A;- anonymity, 
/-diversity, or t-closeness), since otherwise there exists no feasible solution at all. Therefore, the 
trivial partition that puts all records in a single group always guarantees the principle to be met. 
However, such a solution is useless in real- world scenarios, since it will most probably produce a 
table full of St's, which is undesirable in most applications. This extreme example demonstrates 
the importance of finding a balance between data privacy and data integrity. 

Meyerson and Williams [21] proposed a framework for theoretically measuring the data integrity, 
which aims to find a partition (under certain constraints such as /c-anonymity) that minimizes the 
number of suppressed cells (i.e., Vs) in the table. This model has been widely adopted for theo- 
retical investigations of anonymization principles. Under this model, /c-anonymity and /-diversity 
have been extensively studied; more detailed literature reviews will be given later. 

However, in contrast to /c-anonymity and /-diversity, the theoretical aspects of the i-closeness 
principle have not been well explored before. There are only a handful of algorithms designed for 
achieving f-closeness [TTJ, [181 El ES] . The algorithms given by Li et al. [T71[18] incorporate f-closeness 
into /c-anonymization frameworks (Incognito [14] and Mondrian [15]) to find a i-closeness partition. 
Cao et al. [8] proposed the SABRE algorithm, which is the first framework tailored for t-closeness. 
The information-theoretic approach in [25] works for an "average" version of i-closeness. None of 
these algorithms is guaranteed to have good worst-case performance. Furthermore, to the best 
of our knowledge, no computational complexity results of i-closeness have been reported in the 
literature. 

1.2 Our Contributions 

In this paper, we initiate the first systematic theoretical study on the t-closeness principle under 
the commonly- used suppression framework. First, we prove that for every constant t such that 
< t < 1, it is NP-hard to find an optimal t-closeness generalization of a given table. Notice 
that the problem becomes trivial when t = 1, since the EMD between any two sensitive value 
distributions is at most 1, and hence putting each record in a distinct group provides a feasible 
solution that does not need to suppress any value at all, which is of course optimal. Our result shows 
that the problem immediately becomes hard even if the threshold is relaxed to, say, 0.999. At the 
other extreme, a 0-closeness partition demands that the sensitive value distribution in every group 
must be the same with the overall distribution. This seems to restrict the sets of feasible solutions 
in a very strong sense, and thus one might imagine whether there exists an efficient algorithm for 
dealing with this special case. Our result dashes the hope for this idea. The proof of our hardness 
result actually consists of several different reductions. Interestingly, each of these reductions only 
work for a set of special values of t, but altogether they cover the full range [0, 1). We note that 
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the hardness of ii-closeness does not directly imply that of i2-closeness for ti 7^ t2, since they may 
have very different optimal objective values. 

As a by-product of our proof, we establish the NP-hardness of /c-anonymity when k = cn, where 
n is the number of records and c is any constant in (0,1/2]. To the best of our knowledge, this 
is the first hardness result for /c-anonymity that works for k = Q(n). The existing approaches for 
proving hardness of /c-anonymity all fail to generalize to this range of k due to inherent limits of 
the reductions. We note that k = re/2 is the largest possible value for which /c-anonymity can be 
hard, because when k > n/2, any /c-anonymous partition can only contain one group, namely the 
table itself. 

To complement our negative results, we also provide exact and fixed-parameter algorithms 
for obtaining the optimal t-closeness partition. Our exact algorithm for i-closeness runs in time 
2°( n ) ■ 0(m), where re and rre are respectively the number of rows and columns in the input table. 
Together with a reduction that we derive (Lemma H]), this gives a 2°^ ■ 0(m) time algorithm for 
/c-anonymity for all values of k, thus generalizing the result in [3] which only works for constant 
k. We then prove that the problem is fixed-parameter tractable when parameterized by m and the 
alphabet size of the input table. This implies that an optimal t-closeness partition can be found 
in polynomial time if the number of quasi-identifiers and that of distinct attribute values are both 
small (say, constants), which is true in many real- world applications. (We say a problem is fixed- 
parameter tractable with respect to some parameters k\, . . . ,k r , if there is an algorithm solving the 
problem that runs in time h(k\, . . . , k r )n 0<yl \ where n is the size of the input and h is an arbitrary 
computable function depending only on the parameters. Parameterized complexity has become a 
very active research area. For standard notation and definitions in parameterized complexity, we 
refer the reader to [10|.) We obtain our fixed-parameter algorithm by reducing t-closeness to a 
special mixed integer linear program in which some variables are required to take integer values 
while others are not. The integer linear program we derived for characterizing ^-closeness may have 
its own interest in future applications. We note that both of our algorithms work for all values of 
t. 

Last but not least, we review the problems of finding optimal /c-anonymous and /-diverse par- 
titions, and answer two open questions left in the literature. 

• We prove that the 2-diversity problem can be solved in polynomial time, which complements 
the NP-hardness results for / > 3 given in [31]. (We notice that the authors of [9] claimed 
that 2-diversity was proved to be polynomial by [31j . However what |3l] actually proved is 
that the special 2-diversity instances, in which there are only two distinct sensitive values, 
can be reduced to the matching problem and hence solved in polynomial time. They do not 
have results for general 2-diversity. To the best of our knowledge, ours is the first work to 
demonstrate the tractability of 2-diversity.) 

• We then present an m-approximation algorithm for /c-anonymity that runs in polynomial 
time for all values of k. (Recall that m is the number of quasi-identifiers.) This improves the 
0{k) and O(logZc) ratios in [TJ [23] when k is relatively large compared to rre. We note that 
the performance guarantee of their algorithms cannot be reduced even for small values of m, 
due to some intrinsic limitations (for example, |24] uses the tight G(log/c) approximation for 
/c-set cover). 
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1.3 Related Work 



It is known that finding an optimal /c-anonymous partition of a given table is NP-hard for every 
fixed integer k > 3 [21), while it can be solved optimally in polynomial time when k < 2 [1]. 
The NP-hardness result holds even for very restricted cases, e.g., when k = 3 and there are only 
three quasi-identifiers 0(6]. On the other hand, Blocki and Williams [4] gave a 2°^ ■ 0(m) time 
algorithm that finds an optimal k- anonymous partition when k = O(l), where n and m are the 
number of records and attributes (i.e., rows and columns) of the input table respectively. They 
also showed this problem to be fixed-parameter tractable when m and |S| (the alphabet size of the 
table) are considered as parameters. The parameterized complexity of fc-anonymity has also been 
studied in [6l [7J [H] with respect to different parameters. 

Meyerson and Williams [21] gave an 0(klogk) approximation algorithm for £;-anonymity, i.e., 
it finds a /c-anonymous partition in which the number of suppressed cells is at most 0(k log k) times 
the optimum. The ratio was later improved to 0{k) by Aggarwal et al. [I] and to 0(logA;) by Park 
and Shim [24]. We note that the algorithms in [211124) run in time n 0<yk \ and hence are guaranteed 
to be polynomial only if k = 0(1), while the algorithm in [I] has a truly polynomial running time 
for all k. There are also a number of heuristic algorithms for fc-anonymity (e.g., Incognito [14J), 
which work well in many real datasets but have poor worst-case performance guarantee. 

Xiao et al. [31] are the first to establish a systematic theoretical study on /-diversity. They 
showed that finding an optimal /-diverse partition is NP-hard for every fixed integer I > 3 even if 
m, the number of quasi-identifiers, is any fixed integer not smaller than /. They also provided an 
(/ ■ m)-approximation algorithm. Dondi et al. [9] proved an inapproximability factor of cln(Z) for 
/-diversity where c > is some constant, and showed that the problem remains APX-hard even 
if / = 4 and the table consists of only three columns. They also presented an m-approximation 
algorithm when the number of distinct sensitive values is constant, and gave some parameterized 
hardness results and algorithms. 

1.4 Paper Organization 

The rest of this paper is organized as follows. Section [2] introduces notation and definitions used 
throughout the paper, and then formally defines the problems. Section [3] is devoted to proving the 
hardness of finding the optimal i-closeness partition, while Section [J] provides exact and parameter- 
ized algorithms. Sections [5] and [6] present our results for /c-anonymity and 2-diversity, respectively. 
Finally, the paper is concluded in Section [7] with some discussions and future research directions. 

2 Preliminaries 



We consider a raw database that contains m quasi-identifiers (QIs) and a sensitive attribute (SA)o 
Each record t in the database is an (m + l)-dimensional vector drawn from S m+1 , where S is the 
alphabet of possible values of the attributes. For 1 < i < m, t[i] is the value of the i-th QI of t, and 
t[m + 1] is the value of the SA of t. Let S s C E be the alphabet of possible SA values. A microdata 
table (or table, for short) T is a multiset of vectors (or rows) chosen from E m+1 , and we denote by 
\T\ the size of T, i.e., the number of vectors contained in T ■ We will let n = \T\ when the table T 

'Following previous approaches, we only consider instances with one sensitive attribute. Our hardness result 
indicates that one sensitive attribute already makes the problem NP-hard. Meanwhile, it is easy to verify that our 
algorithms also work for the case where multiple sensitive attributes exist. 
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3* 




Heart Disease 


3 


98*** 


3* 




Heart Disease 



Table 4: The first three records in Table 1. 



is clear in the context. Note that T may contain identical vectors since it can be a multiset. We 
also use T[j] to denote the j-th vector in T under some ordering, e.g., 7~[3][m + 1] is the SA value 
of the third vector of T ' ■ 

Let * be a fresh character not in S. For each vector t £T, let t* be the suppressor of t (inside 
T) defined as follows: 

• t*[m + 1] = t[m+ 1]; 

• for 1 < i < m, t*[i] = t[i] if t[i] = t'[i] for all t' G T, and t*[i] = * otherwise. 

The cost of a suppressor t* is cost(t*) = |{1 < i < m \ t*[i] = *}|, i.e., the number of '*'s in 
t* . It is easy to see that all vectors in T have the same suppressor if we only consider the quasi- 
identifiers. The generalization of T is defined as Gen(T) = {t* \ t G T}. (Note that Gen(T) is 
also a multiset.) The cost of the generalization of T is cost(T) = Ylt*eGen(T) cost (i* ) , i.e., the sum 
of costs of all the suppressors. Since all suppressors in T have the same cost, we can equivalently 
write cost(T) = |T| • cost(t*) for any t* G Gen(T). 

As an illustrative example, Table [4] consists of the first three record of Table [U which contains 
eight QIs (we regard each digit of Zip-code and Age as a separate QI) and one SA. The generalization 
of Table [3] is also shown. In this case all suppressors have cost 5, and the cost of this generalization 
is 5 • 3 = 15. 

A partition V of table T is a collection of pairwise disjoint non-empty subsets of T whose union 
equals T ■ Each subset in the partition is called a group or a sub-table. The cost of the partition 
V, denoted by cost(V), is the sum of costs of all its groups. For example, the partition of Table Q] 
given by Table[2]has cost 5-3 + 6- 4 + 5- 3 = 54. 

2.1 t-Closeness Principle 

We formally define the £-closeness principle introduced in [17] for protecting data privacy. Let 
T be a table, and assume without loss of generality that S s = {1, 2, . . . , |S S |}. The sensitive 
attribute value space (SA space) is a normalized metric space (T, s ,d), where d(-,-) is a distance 
function defined on S s x S s satisfying that (l)d(i,i) = for any i G S s ; (2)d(i,j) = d(j,i) for all 
i,j G S s ; (3)d(i,j) + d(j,k) > d(i,k) for k G S s (this is called the triangle inequality); and 
(4)maxjj e £ s d(i,j) = 1 (this is called the normalized condition). 

For a sub-table M C T and i £ E s , denote by n(M,i) the number of vectors whose SA value 
equals i. Clearly \M\ = ^igs s n (M, The sensitive attribute value distribution (SA distribu- 
tion) of M, denoted by P(M), is a |S s |-dimensional vector whose i-th coordinate is P(M)[i] = 
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n(M,i)/\M\ for 1 < i < |E S |. Thus P(M) can be seen as the probability distribution of the SA 
values in M, assuming that each vector in M appears with the same probability. For a threshold 
< t < 1, we say M have t-closeness (with T) if EMD (P (M) , P (T) ) < t, where EMD(X, Y) is 
the Earth-Mover Distance (EMD) between distributions X and Y [26]. A t-closeness partition of 
T is one in which every group has t-closeness with T. 

Intuitively, the EMD measures the minimum amount of work needed to transform one probabil- 
ity distribution to another by means of moving distribution mass between points in the probability 
space; here a unit of work corresponds to moving a unit amount of probability mass by a unit of 
ground distance. The EMD between two SA distributions X and Y can be formally defined as the 
optimal objective value of the following linear program [261 117j : 

subject to: 

VI < i < |S S | 

VI < j < |S S | 
VI < i,j < |£ g |. 

The above constraints are a little different from those in [T7J; however they can be proved equiva- 
lent using the triangle inequality condition of the SA space. It is also easy to see that EMD(X, Y) = 
EMD(Y,X). By the normalized condition of the SA space, we have < EMD(X, Y) < 1 for any 
SA distributions X and Y. 

The equal- distance space refers to a special SA space in which each pair of distinct sensitive 
values have distance exactly 1. There is a concise formula for computing the EMD between two 
SA distributions in this space. 

Fact 1 (p2]). For any two SA distributions X and Y in the equal- distance space, we have 

|E S | 

EMD(X,Y) = | 53 |X[i] - Y[i]| = Yl - Y[<]). 

i=l l<i<|£ s |:X[i]>Y[i] 

Therefore, in the equal-distance space, the EMD coincides with the total variation distance 
between two distributions. 

Let us go back to Table Q] for an example. We let 1,2, and 3 denote the sensitive values "Viral 
Inspection" , "Heart Disease" , and "Cancer" , respectively. Let the SA space be the equal-distance 
space. The SA distribution of the whole table is then (0.3,0.3,0.4). Suppose we set the threshold 
t = 0.3. It can be verified that Table El although being a 2-diverse partition, is not a t-closeness 
partition of Table [TJ In fact, the SA distribution of the first group is (0.5,0.5,0), and hence the 
EMD between it and the overall distribution is 0.4. (This example also reflects some property of 
the skewness attack that /-diversity suffers from. If an attacker can locate the record of Alice in the 
first group of Table El then he knows that Alice does not have a cancer. If he in addition knows that 
Alice comes from some district where people have a very low chance to have heart disease, then he 



Minimize ^ ^ j)f(hj) 
i=i j=i 

|S S | 

^/(i,i) = X^, 

3=1 

J2f(iJ) = Y[j], 

i=i 

fiij) >0, 
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Table 5: A 0.3-closeness partition 



would be confident that Alice has a viral infection.) We instead give a 0.3-closeness partition in 
Table EJ We can actually verify that it is even a 0.1-closeness partition. 
Now we are ready to define the main problem studied in this paper. 

Problem 1. Given an input table T, an SA space (E s ,d), and a threshold t 6 [0,1], the t- 
Closeness problem requires to find a t-closeness partition of T with minimum cost. 

Finally we review another two widely-used principles for privacy preserving, namely /c-anonymity 
and /-diversity, and the combinatorial problems associated with them. A partition is called k- 
anonymous if all its groups have size at least k. A (sub-)table AA is called l-diverse if at most 
|.M|/7 of the vectors in A4 have an identical SA value. A partition is called Z-diverse if all its groups 
are ^-diverse. 

Problem 2. Let T be a table given as input. The Anonymity (I -Diversity ) problem requires 
to find a k-anonymous (l-diverse) partition ofT with minimum cost. 

3 NP-hardness Results 

In this section we study the complexity of the i-CLOSENESS problem. The problem is trivial if 
the given threshold is t = 1, since putting each vector in a distinct group produces a 1-closeness 
partition with cost 0, which is obviously optimal. Our main theorem stated below indicates that 
this is in fact the only easy case. 

Theorem 1. For any constant t such that < t < 1, t-CLOSENESS is NP-hard. 

We will prove Theorem [1] via several reductions, each covering a particular range of t, which 
altogether prove the theorem. We first present a result that relates £-Closeness to /c- Anonymity. 

Lemma 1. There is a polynomial-time reduction from /c- Anonymity to ^-Closeness with equal- 
distance space and t = 1 — k/n. 

Proof. Let T be an input table of k- Anonymity. We properly change the SA values of vectors in T 
to ensure that all their SA values are distinct; this can be done because the SA values are irrelevant 
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Figure 1: Reduction from MinBisection to (n/2)- Anonymity. 



to the objective of the A;- Anonymity problem. Assume w.l.o.g. that the SA values are {1,2,... , n}. 
Consider an instance of t-CLOSENESS with the same input table T, in which t = 1 — k/n and the 
SA space is the equal-distance space. The SA distribution of T is (1/n, 1/ra, . . . , 1/re). In the 
SA distribution of each size-r group T r , there are exactly r coordinates equal to 1/r and n — r 
coordinates equal to 0. It is easy to see that EMD(P(T), P(7^)) = (n — r)(l/n) = 1 — r/n. Hence, 
a group has i-closeness if and only if it is of size at least k. Therefore, each fc-anonymous partition 
of T is also a t-closeness partition, and vice versa. The lemma follows. □ 

By Lemma [1] we can directly deduce the NP-hardness of £-Closeness when the threshold t is 
given as input, using e.g. the NP-hardness of 3- Anonymity |21| . However, to show hardness for 
constant t that is bounded away from 1, we need k/n = 0(1) and thus k = f2(n). Unfortunately, 
the existing hardness results for k- Anonymity only work for k = 0(1) and cannot be generalized 
to large values of k. For example, most hardness proofs use reductions from the /c-dimensional 
matching problem, but this problem can be solved in polynomial time when k = Q(n). Below we 
show the NP-hardness of A;- Anonymity for k = f2(n) via reductions different from all previous 
approaches in the literature. 

Theorem 2. For any constant c such that < c < 1/2, (en) -Anonymity is NP-hard. 

To the best of our knowledge, Theorem [2] is the first hardness result for A> Anonymity when 
k = Q(n). We note that the constant 1/2 is the best possible, since for any k > n/2, a A:-anonymous 
partition can only contain one group, namely the table itself. We first prove the following result, 
which will be used as a starting point in further reductions. 

Theorem 3. (n/2) -Anonymity is NP-hard. 

Proof. We will present a polynomial-time reduction from the minimum graph bisection (MinBisection) 
problem to (n/2)- Anonymity. MinBisection is a well-known NP-hard problem [12], [13] defined 
as follows: given an undirected graph, find a partition of its vertices into two equal-sized halves so 
as to minimize the number of edges with exactly one endpoint in each half. 

Let G = (V,E) be an input graph of MinBisection, where \V\ > 4 is even. Suppose V = 
{vi, i>2, • • • , v n } and E = {e\,e2, • • • , e m }. In what follows we construct a table T of size n = \V\ 
that contains m = \E\ quasi- identifiers. (The sensitive attributes are useless in /j-Anonymity 
so they will not appear.) This table will serve as the input to the /c-Anonymity problem with 
k = n/2. Intuitively each row (or vector) of T corresponds to a vertex in V, while each column 
(or QI) of T corresponds to an edge in E. The alphabet S is {1, 2, ... , n}. For each i € [n] and 
j € [m]0 let T[«][j] = i if V{ € ej, and T[i][i] = if v \ ej. Thus each column contains exactly 

' We use [q] to interchangeably denote {1, 2, . . . , q}. 
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two non-zero elements corresponding to the two endpoints of the associated edge. See Figure Q] for 
a toy example. It is easy to see that T can be constructed in polynomial time. 

Before delving into the reduction, we first prove a result concerning the partition cost of T ■ 
Any (n/2)-anonymous partition of T contains at most two groups. For the trivial partition that 
only contains T itself, the cost is n ■ m because all elements in T should be suppressed. Thus 
an (n/2)-anonymous partition with minimum cost should consist of exactly two groups. Suppose 
V = {71,72} is an (n/2)-anonymous partition of T where |71| = \T2\ = n/2. Let {Vi, V2} be the 
corresponding partition of V (recall that each vector in T corresponds to a vertex in V). Consider 
Gen(Ti), the generalization of 71- For any column j £ [m], if some endpoint of ej, say Vi, belongs 
to Vi, then T[z][j] = i. By our construction of T, any other element in the j-th column does not 
equal to i. Since |71| > n/2 > 2, column j of 71 must be suppressed to *. On the other hand, if 
none of e^-'s endpoints belongs to V\, then column j of 71 contains only zeros and thus can stay 
unsuppressed. Therefore, we obtain 

costijl) = \Ti\ ■ {\E n \ + \E l2 \) = n(\E u \ + \E 12 \)/2, (1) 

where E pq denotes the set of edges with one endpoint in V p and another in V q , for p, q € {1,2}. 
Similarly we have cost(T2) = n ( | -£^22 1 + \E\2\)/2. Hence the cost of the partition V is 

2 

cost(V) = J2 cost (%) = n (\ E \ + |£i2|)A (2) 
P =i 

noting that \E\ = \E\x \ + \E\2 \ + l-E^I- 

We now prove the correctness of the reduction. Let OPT be the minimum size of any cut {V±, V2} 
of G with |Vi| = [V2 [, and OPT' be the minimum cost of any (n/2)-anonymous partition of T. 
We prove that OPT' = n(\E\ + OPT)/2, which will complete the reduction from MinBisection 
to (n/2)- Anonymity. Let {Vi,V2} be the cut of G achieving the optimal cut size OPT, where 
|^i| = = n/2. Using notation introduced before, we have OPT = \Ei2\. Let V = {71,72} 
where T p = {T[i] \ vi E V p } for p € {1,2}. Clearly V is an (n/2)-anonymous partition of 7". By 
Equation ^ we have OPT' < cost(V) = n(\E\ + OPT)/2. 

On the other hand, let V = {T{, 7^'} be an (n/2)-anonymous partition with cost(V') = OPT'. 
We have \T{\ = |7j| = n/2. Consider the partition {V{, V^} of V with V p = {vi \ T[i] € T p } for 
p 6 {1,2}. Since \V[\ = |T^'| = n/2, we have OPT < \E' 12 \ where E' 12 denotes the set of edges 
with one endpoint in V{ and another in V 2 - By Equation ([2]) we have OPT' = n(\E\ + \E' 12 \)/2 > 
n(\E\ + OPT)/2. Combined with the previously obtained inequality OPT' < n(\E\ + OPT)/2, 
we have shown that OPT' = n(\E\ + OPT)/2. By the analyses we also know that an optimal 
(n/2)-anonymous partition of T can easily be transformed to an optimal equal-sized cut of G. 
This finishes the reduction from MinBisection to (n/2)- Anonymity, and completes the proof of 
Theorem [3l □ 

Theorem 4. For any constant c such that < c < 1/3, (cn) -Anonymity is NP-hard. 

Proof. Fix < c < 1/3. We reduce (n/2)- Anonymity to (cn)- Anonymity, which will prove the 
NP-hardness of the latter due to Theorem [3l Let T be an instance of (n/2)- Anonymity with n 
rows and m QI columns. Choose two fresh symbols Ai,A2 not appearing in T. We construct a 
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T' with 5+4*5=25 columns 



Figure 2: Reduction from (n/2)- Anonymity to (n/3)- Anonymity. 

new table T with n' = n/2c rows and m' = m + nm QI columns as follows! For all 1 < % < n, 
let T'[i][j] = T[i][j] for 1 < j < m, and T'[i][j] = Ai for m + 1 < j < ml . For all n + 1 < i < n' 
and 1 < j < m' , let T'[i][j] = A2. This finishes the description of T' ■ See Figure [2] for an example 
where c = 1/3, Ai = 'A', and A2 = 'B'. Clearly T can be constructed in polynomial time. 

Let OPT denote the minimum cost of an (n/2)-anonymous partition of T and OPT' be the 
minimum cost of a (cn')-anonymous partition of T' ■ We next prove OPT = OPT', which will 
complete the reduction from (n/2)- Anonymity to (en)- Anonymity. 

On one hand, let V = {71 , T2} be an (n/2)-anonymous partition of T with cost(V) = OPT. We 
have |7l| = \T 2 \ = n/2. Define a partition V of V as {T{,T{,%}, where T p ' = {T'[i] | T[i] € T p , 1 < 
i < n} for p G {1, 2}, and % = {T'[i\ \ n + 1 < i < n'}. We have \T{\ = |7a | = n/2 > c(n/2c) = cri, 
and \Tl\/n' = (n/2c — n)/(n/2c) = 1 — 2c > c as c < 1/3. Hence V' is a (cn)-anonymous partition 
of V. It is easy to verify that cost(V') = cost(V) = OPT, implying that OPT' < OPT. 

On the other hand, let V' = {T{, . . . , 77} be a (cn)-anonymous partition of T' with cost(V) = 
OPT' . For the simplicity of expression, we call T'\i\ an old row if 1 < i' < n, and call it a new row 
if n + 1 < i < n'. First assume that there exists 77 & P' that contains both an old row and a new 
row. By our construction of 7"', an old row and a new row differ in all the last nm coordinates, and 
thus the cost for generalizing Tp is at least 2nm. Since OPT < nm, we have OPT' > OPT in this 
case, which cannot happen since we already proved OPT' < OPT. Therefore, for any T p ' € V , it 
either contains only old rows or contains only new rows. Assume w.l.o.g. that T[, ■ ■ ■ ,TL are the 
sub-tables in V' that contain only old rows. Since all new rows are identical by our construction, 
we have 

r' 

OPT' = cost(V') = cost(Tp). (3) 
P =i 

We now define a partition of T as V = {7i, • ■ • , T r >}, where T p = {T[i] \ T'[i] S Tp} for all 
1 < p < r' . Because \T P \ = \Tp\ > cn' = c(n/2c) = n/2, V is an (n/2)-anonymous partition of 
T ■ As the last nm + 1 columns are identical for all old rows of T , we have OPT < cost(V) = 
T,p =1 cost(T p ) = Yfp^costfJl) = OPT' by Equation ©. Combined with that OPT' < OPT 
obtained previously, we obtain that OPT = OPT', and that an optimal (cn)-anonymous partition 
of T' can be easily transferred to an optimal (n/2)-anonymous partition of T ■ This finishes the 
reduction from (n/2)- Anonymity to (cn)- Anonymity, and completes the proof of Theorem HI □ 

So far we have shown the hardness of (cn)- Anonymity for c G (0,1/3] U {1/2}. For the 
remaining case c E (1/3, 1/2) we need a different reduction. 

'Here we assume n/2c is an integer, otherwise we can use [n/2c] instead and get the same result, with more 
tedious analyses. Similar issues appear also in other proofs, which we will not mention again. 
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Theorem 5. For any constant c such that 1/3 < c < 1/2, (cn) -Anonymity is NP-hard. 



Proof. Fix 1/3 < c < 1/2. We will present a polynomial reduction from the following problem 
to (cn)- Anonymity: given an undirected graph G = (V,E), decide whether G contains a clique 
(i.e., a subgraph in which every pair of vertices have an edge between them) that contains exactly 
|V|/2 vertices. Call this problem HalfClique. The NP-hardness of HalfClique easily follows 
from that of the well-known maximum clique problem, as can be seen as follows. We reduce the 
classical Clique problem to HalfClique. Given a graph G = (V,E) and an integer k < |V|, the 
Clique problem asks whether G contains a clique with exactly k vertices. This is a well-known 
NP-hard problem [12]. Now construct another graph G' based on G as follows: if k > \V\/2, then 
add 2k — \V\ new isolated vertices to V; if k < \V\/2, then add \V\ — 2k new vertices to V and 
connecting them with each other as well as all original vertices in V. Let V be the new vertex set. 
It is easy to verify that G has a clique of size k if and only if G' has a clique of size |V'|/2, which 
completes the reduction. 

Let G = (V,E) be an input graph of HalfClique with \V\ = n > 4 and \E\ = m. Assume 
V = {vi, . . . , v n } and E = {e±, . . . , e m }. We construct a table 7" with n' = n/2c rows and m QI 
columns as follows. For 1 < i < n and 1 < j < m, let T[i][j] = i if v i £ Cj, and 7~[i][j] = 
otherwise. For n + 1 < i < n' and 1 < j < m, let T[i][j] = i. (Note that, in some sense, this 
construction can be seen as a combination of those used in the proof of Theorems [3] and HJ however 
the analysis will be different and more intriguing.) 

We first prove a result regarding the structure of an optimal (cn)-partition of T ■ Call T[i] an 
old row if 1 < % < n, and a new row if n + 1 < i < n'. We assume that j- — n > 2, i.e., T contains 
at least two new rows; this is without loss of generality because c is a constant smaller than 1/2. 
Since c > 1/3, any (cn)-anonymous partition contains at most two groups. The trivial partition 
that consists of T itself need to suppress every coordinate in the table, because a new row and an 
old row do not share common values. Therefore, the minimum cost (cn)-partition of T contains 
exactly two groups. 

Denote by OPT the minimum cost of a (cn)-anonymous partition of T . We claim that G 
contains a clique of size n/2 if and only if OPT < n'm — (n/2)( n ^ 2 ). First consider the "only 
if" part. Assume Vj> C V is a clique of size n/2, and let V\ = V \ Vi- Then IV2I = |Vi| = n/2. 
For p, q £ {1,2}, denote by E pq the set of edges with one endpoint in V p and another in V q . We 
define a partition V = {71,72} of T by letting T\ = {T[i\ \ vi € Vi} and T2 = T \ T{. Since 
|7l| = n/2 = c(n/2c) = cn' and |72|/n' = (n/2c — n/2) / (n/2c) = 1 — c>c, V is a (cn)-anonymous 
partition. Similar to the proof of Theorem[3l we have cost (71) = n(|Sn| + |£'i2|)/2 (see Equation (pQ) 
and its proof). Since 72 contains both old and new rows, we have cost (72) = | '7s | -m = (n' — n/2)m. 
Therefore, 



where the last equality holds because V% is a clique of size n/2. This proves the "only if" part of 
the claim. 



OPT < cost(V) = cost(Ti) + cost(T 2 ) 

= n(\E n \ + \E 12 \)/2 + (n' -n/2)m 

= n(m - I-E22D/2 + (n' - n/2)m 

= n'm - (n/2) I £ 22 1 
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Next we consider the "if" direction. Let V = {71,72} be a (cn)-partition with cost{V) = 
OPT < n'm — (n/2)( n ^ 2 ). As argued before, every sub-table that contains both old and new 
rows need to be suppressed totally. Thus, if both 71 and 7i contain both old and new rows, then 
cost{V) = n'm, which is worst possible. In this case we can change 71 to be any set of n/2 old 
rows and let T 2 = 7" \ 71 to obtain a (cn)-partition with no worse cost. Therefore, in what follows 
we assume w.l.o.g. that 71 consists of only old rows. 

Let V\ = {vi | T[i] £ 71} and V 2 = V \ V\. Define E pq analogously as before for p, q e {1, 2}. 
Similar to the proof of Theorem El we have cost (71) = |Vi|(|Sn| + 1-^12|) (just replace n/2 with |V1| 
in Equation {!])). Also cost{T 2 ) = |7i| ■ m = (n' — \Vi\)m since 72 contains both old and new rows. 
Hence, cost{V) = \Vi\(\E n \ + \E 22 \) + (n' -\Vi\)m = \Vi\(m-\E 22 \) + (n' -^m = n'm-\Vi\-\E 22 \. 
On the other hand, cost(V) = OPT < n'm - {n/2){ n/ 2 2 ). Thus we have |Vi| • \E 22 \ > (n/2)( n / 2 ). 
As \Vi\ + \V 2 \ = n and \E 22 \ < ('^ 2 '), we obtain that 

(»-|V 2 |)( 1 ^ 1 ) >|^|.|^l>^("f). (4) 

Because \V\_\ = |71| > cn' = n/2, we have IV2I < n/2. Define a fucntion / : [0,n/2] — > R as 
f(x) = (n — x) (2) = (n — x)x{x — l)/2 for all < x < n/2. Then Equation Q indicates that 
/(IV2I) > /(n/2). Since /(0) = 0, \V 2 \ > 1 holds. Let /' be the derivative of / with respect 
to x. It is easy to verify that f'(x) = i(— 3x 2 + 2(n + l)x — n). The minimum value of f'(x) 
when 1 < x < n/2 can only be obtained at x € {l,n/2, (n + l)/3}. Simple calculations show that 
/'(l), /'(n/2), and /'((n+ l)/3) are all positive. Hence f'(x) > for all 1 < x < n/2, which means 
that f{x) is strictly monotone increasing on [1, n/2]. Since we know that /(IV2Q > /(n/2) and that 
1 < IV2I < n/2, it must hold that \V 2 \ = n/2. Therefore (j4|) holds with two equalities. We thus 
have \E 22 \ = ( n 2 2 ), implying that V 2 is a clique of size n/2. 

We have shown that G has a clique of size n/2 if and only if T has a (cn)-anonymous partition of 
cost at most n'm — {n/2) Q) • This completes the reduction from HalfClique to (cn)- Anonymity, 
from which Theorem [5] follows. □ 

Now Theorem [2] follows straightforward from Theorems El H] and Interestingly, the three 
reductions work for disjoint ranges of c, which altogether give the desired result. By Lemma [T] we 
obtain: 

Corollary 1. For any constant t such that 1/2 < t < \, t-CLOSENESS is NP-hard even with 
equal- distance space. 

We next show the hardness of £-Closeness for < t < 1/2 by two different reductions from 
the 3-dimensional matching problem, each of which covers a different range of t. 

Theorem 6. For any constant t such that < t < 1/3, t-CLOSENESS is NP-hard even if |E S | = 3. 

Proof. Fix < t < 1/2. We perform a polynomial-time reduction from the 3-dimensional matching 
problem (3D-Matching) to t-CLOSENESS. The input of 3D-Matching consists of three equal- 
sized pairwise-disjoint sets X, Y, and Z, together with a collection S of 3-tuples from X x Y x Z. 
The goal is to decide whether there exists a set of \X\ tuples from S that covers each element of 
X UY U Z exactly once. This problem is well known to be NP-hard |12| . 

Consider an instance of 3D-Matching. Assume \X\ = \Y\ = \Z\ = n, U = X U Y U Z = 
{vi,v 2 , . . . ,V3 n , and the set of tuples is S = {e\,e 2 , . . . ,e m }. Each tuple in S is regarded as a 
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subset of U of size 3. The reduction that we will use is similar to that in |31j . We construct an 
instance of £-Closeness as follows. The table T has 3n rows and m QI columns as well as an 
SA column. For every 1 < i < 3n and 1 < j < m, let T[«][j] = i if Vj ej and 7~[i][i] = if 
Vi £ ej. Let T[i][m + 1] be 1, 2, or 3, if vi belongs to X, Y, or Z, respectively. The SA space is the 
equal-distance space. Notice that each QI column of T contains exactly three zeros, corresponding 
to the three elements in the tuple associated with this column. Also note that the SA distribution 
of T is P(T) = (1/3, 1/3, 1/3) since \X\ = \Y\ = \Z\. 

We will prove that, there exists n tuples of S whose union equals U = X UY U Z ii and only 
if T has a i-closeness partition of cost at most 3n(m — 1). This will complete the reduction from 
3D-Matching to ^-Closeness. 

First consider the "only of" direction. Assume that there exists S' C S, \S'\ = n, such that 
UeeS' e = U. We assume w.l.o.g. that S' = {e±, e2, • • • , e n }. Define a partition V of T as follows: 
V = {71, . . . , T n }, where T p = {T[i] \ vi £ e p } for all 1 < p < n. Clearly |7I [ = ... = \T n \ = 3. Since 
each e p contains exactly one element from each of X, Y and Z, we have P(7^) = (1/3, 1/3, 1/3) = 
P(T). Hence V is a i-closeness (and in fact O-closeness) partition of T ■ By our construction, for 
each p £ [n], the p-th column of T p consists of three zeros, and every other column contains at least 
two different QI values. Thus cost(T p ) = 3(m — 1), and costiV) = J2 P =i cost(T p ) = 3n(m — 1). The 
"only if" direction is proved. 

We next consider the "if" direction. Let V = {71, ... ,7^} be a t-closeness partition of T with 
cost at most 3n(m — 1). We claim that \7~ p \ > 3 for all p € [r]. Assume to the contrary that \T P \ < 2 
for some p. Then P(T P ) is either (0,1/2,1/2) or (0,0,1) up to permutations of the coordinates. 
It is easy to verify that EMD(P(7J>), P(7~)) > 1/3 > c in both cases, which contradicts the fact 
that V is a t-closeness partition. Hence, \T P \ > 3. If \T P \ > 4, then cost(T p ) = \T P \ ■ m, because 
each column of T consists of three zeros and 3n — 3 distinct non-zero values and thus needs to be 
suppressed entirely in T p . If \T P \ = 3, then cost(T p ) = 3(m — 1) if there is a tuple in S that contains 
the three elements associated with the vectors in T p (in which case the column corresponding to this 
tuple needs not be suppressed), and cost(T p ) = 3m otherwise. Since costiV) = 3n{m — 1), every 
group T p is of size 3 and induces a tuple, say e p £ S. Then {e\, . . . , e p } is a set of n tuples whose 
union equals U, proving the "if" direction. This completes the reduction from 3D-Matching to 
t-CLOSENESS, and Theorem [6] follows. □ 

Finally we come to the last part t £ [1/3, 1/2). 

Theorem 7. For any constant t such that 1/3 < t < 1/2, t-CLOSENESS is NP-hard even if \T, S \ = 4. 

Proof. Fix 1/3 < t < 1/2. We give a reduction from 3D-MATCHING to ^-Closeness similar to 
that used in the the proof of Theorem [H with some more ingredients. Consider an instance of 
3D-MATCHING. The element set is U = X U Y U Z = {vi,v 2 , . . . , v Sn } where \X\ = \Y\ = \Z\ = n. 
The tuple set is S = {e\, . . . , e m } where each e,, 1 < i < m, is a subset of U of size 3 that consists of 
exactly one element from each of X, Y, and Z. The goal is to decide whether there exists S' C S, 
\S'\ = n, such that U e gs' e = U. 

We set up an instance of t-CLOSENESS as follows. The table T consists of n' = 3n/(l — 2t) 
rows, m QI columns, and an SA column. For all 1 < i < 3n and 1 < j < m, T[i][j] = i if V{ £ ej 
and T[i]\j] = if v, L £ For 1 < i < 3n, T[i][m + 1] = 1 if £ X, T[i][m + 1] = 2 if £ and 
T[i] [m + 1] = 3 if v i £ Z. For 3n + 1 < i < n', T[i] [j] = i for 1 < j < m, and 7~[i] [m + 1] = 4. Note 
that S s = {1, 2, 3, 4}. Define the distance function of the SA space as d(l, 2) = d(l, 3) = d(2, 3) = 1 
and d(4, 1) = d(4, 2) = (i(4, 3) = 1/2; this clearly forms a metric on S s . It is easy to verify that 
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P(T) = (— ht^> The goal is to decide whether T has a i-closeness partition. Before 

showing the correctness of the reduction, we present a formula for computing the EMD between 
two distributions under this metric. Let A = (01,02,03,04) and B = (61, &2, 63, 64) be two SA 
distributions with 04 > 64. Then, 

EMD(A,B) = i(o 4 -M+ Y, ( a i-h). (5) 

ie{l,2,3}:ai>6j 

This can be seen as follows. Let 5> = {1 < i < 4 | o; > bi} and 5< = {1,2,3,4} \ S>. We 
have 4 € S> and X^ie5>( ai ~~ = J2jeS < ( b j ~ a j)- ^° transform A to B, we need to move 
M = X^gs>( a i — &i) amount of mass from S> to 5'<. 04 — 64 amount of mass at point 4 can be 
moved out by distance 1/2, while the remaining amount must be moved by distance 1. Therefore 

EMD(A,B) = |(a 4 - 64) + Eies>\{4}(°; ~ b i) = U a * ~ b ±) + Eie{i,2,3}: ai >6 i K " b i)- 

We prove that the answer to the matching instance is yes if and only if T has a i-closeness 
partition of cost at most 3n(m— 1). First consider the "only if" direction. Assume w.l.o.g. that S' = 
{ei, . . . , e n } satisfies U eG 5' e = u - Define a partition V = {T\, T 2 , . . . , %}U {7^' n+1 , 7^ n+2 , . . . , 7^,}, 
where T p = {T[i} \ i S e p } for 1 < p < n and Tp = {T[p\} for 3n + 1 < p < n', i.e., each T p ' 
consists of a single row. By similar arguments as in the proof of Theorem[6l cost(T p ) = 3(m — 1) for 
1 < p < n, and obviously cost(T p r ) = for 3n + 1 < p < n' . Hence cost(V) = 3n(m — 1). It remains 
to show that V is a t-closeness partition. For 1 < p < n, P(7^) = (1/3, 1/3, 1/3, 0), by Equation ([5]) 
we have EMD(P(T P ),P(T)) = \ ■ 2t = t. Since P(7^) = (0,0,0,1) for all 3n + 1 < q < n', 
EMD(P(7^'),P(7")) = \{l - 2t) < t as t > 1/3 (actually this holds for all t > 1/4). This proves 
that V is a t-closeness partition, and hence the "only if" direction. 

Now consider the "if" direction. Let V = {71, ... ,7^} be a t-closeness partition of T with cost 
at most 3n(m — 1). Call T[i] an old row if 1 < i < 3n, and a new row if i > 3n. By our construction 
of T, it is clear that cost(T p ) = \T P \ • m if \T P \ > 2 and T p contains at least one new row. Now 
let T p be a group containing only old rows. If \T P \ < 2, then P(7J,)) is equivalent to (1,0,0,0) or 
(1/2, 1/2,0,0) up to permutations of the first three coordinates. By © and the fact that t < 1/2, 
we can verify that EMD(P(7^), P(7")) > t in both cases. Therefore \T P \ > 3. Analogous to the 
proof of Theorem[6l we know that cost(T p ) = 3(m — 1) if T p consists of three old rows corresponding 
to three elements in the same tuple, and cost(T p ) = \T p \-m otherwise. Thus for cost(V) = 3n(m — l) 
it must be the case that there exist n groups each of which consists of three old rows, and each of 
the remaining groups consists of exact one new row. As groups are disjoint, they together cover all 
the 3n old rows, which naturally induces n tuples of S whose union equals U. The "if" direction is 
thus proved. This completes the reduction from 3D-Matching to £-Closeness, and Theorem [7] 
follows. □ 

4 Exact and Fixed- Parameter Algorithms 

In this section we design exact algorithms for solving i-CLOSENESS. Notice that the size of an 
instance of i-CLOSENESS is polynomial in n and m + 1. The brute- force approach that examines 
each possible partition of the table to find the optimal solution takes rfi^mP^ = 2 ( nlogn )m°( 1 ) 
time. We first improve this bound to single exponential in n. (Note that it cannot be improved to 
polynomial unless P = NP.) 

Theorem 8. The ^-Closeness problem can be solved in 2 0(n ) • 0{m) time. 
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Proof. Consider an input table T of the £-Closeness problem. Assume that V = {71, . . . , %} is 
an optimal f-closeness partition of 7" (note that we do not know V; it is only used for analysis). 
Obviously there is at most one group T p with \T P \ > n/2. We claim that, if \T P \ < n/2 for all p 6 [r], 
then there is a disjoint partition (A\, A2) of {1,2,..., r} such that n/4 < | U p ga 4 TpI — 3n/4 for any 
i € {1,2}. This can be seen as follows. Denote by n(A) = | LLg^Tpl f° r an Y A C [r]. Let (Ai,^) 
be the partition of [r] that minimizes |n(Ai) — n{A2)\. Assume w.l.o.g. that n(A±) < n(A2). If 
n(A2) < 3n/4, the claim is proved. Otherwise, ^2 contains at least two groups, and we move 
an arbitrary group from A2 to A\ resulting in a new partition (A[,A 2 ). If n(A[) < n(A' 2 ), then 
\n(A[) - n(A' 2 )\ = n{A' 2 ) - n(A[) < n(A 2 ) - n{A{) = \n(Ai) - n(A 2 )\, which contradicts the way 
in which (A±,A2) is chosen. We thus have n(A' 1 ) > n(A 2 ), and so n(^) > n/2. Since each group 
has size at most n/2, we have n/2 < n(yl / 1 ) < n(A±) + n/2 < 3n/4, and hence n/2 > n(A 2 ) = 
n — n(vl / 1 ) > n/4. This proves the claim. 

For any M C T, let OPT(M) denote the minimum cost of any partition of M in which each 
group is i-close to T; thus the optimal cost of the problem is OPT(T). We now have a natural 
recursive algorithm for computing OPT(T): Enumerate all 71 C T with n/4 < |7i| < 3n/4 and 
find the one minimizing OPT(T\) + OPT(T\ 71); denote this minimum cost by OPT\. We also 
exhaustively find T{ Q T with \T{\ > n/2 that minimizes OPT(T{) + OPT(T \ T{), which is 
denoted by OPT2. By our previous analysis, OPT(T) = min{OPTi, OPT2} and thus we can solve 
t-CLOSENESS by taking the better solution. Two notes on the recursive steps: (1) If we have a table 
of constant size (say, less than 10) then we can directly solve it in 0{m) time by the brute- force 
approach. (2) If we have a table T' such that EMD(P(T'), P(7")) > t then we return with cost 
+00. 

We now analyze the running time of the algorithm. Let f(s) denote the running time on a 
sub-table of T of size s. When s < 10 we have f(s) = 0(m), and when s > 10, 

3s/4 . . s / \ 

f(s) < ]T r).2/(3*/4)+ £ r)f(s/2) + 0(2 s ) 

i=s/4 ^ ' i=s/2 ^ ' 

< 2 s+2 f(3s/4) + 0(2 s ). 

In the first inequality, the first term stands for the time of enumerating 71 with n/4 < \Ti\ < 3n/4, 
the second term is for the enumeration of T[ with \T{\ > n/2, and the third term is responsible 
for other works such as recording the subsets. It is easy to verify that this recursion gives /(n) < 
2°( n )-0(m). □ 

In many real applications, there are usually only a small number of attributes and distinct at- 
tribute values. Thus it is interesting to see whether i-CLOSENESS can be solved more efficiently when 
m and |S| is small. We answer this question affirmatively in terms of fixed-parameter tractability. 

Theorem 9. £-Closeness is fixed-parameter tractable when parameterized by m and Thus 
we can solve £-Closeness optimally in polynomial time when m and |Sj are constants. 

Proof. Consider an input table T with n rows and m + 1 columns (of which m are QIs and one 
is SA). For v € S m and s G S s , denote by R v ^ s the set of vectors in T that is identical to (v,s), 
and let r„ jS = \R V)S \- We thus have X^e£ m ses s r v,s = n - We write a integer linear program to 
characterize the minimum cost of a t-closeness partition of 7". For every v G S m and s£E s such 
that Ry tS / 0, and every d* 6 (SU {*}) m that generalizes v, there is a nonnegative integer variable 
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x(y*,v, s) which means the number of vectors in R VtS that is generalized to (v*, s) in the partition. 
We clearly have 

^ x(v*,v,s) = r VjS , V(u, s) s.t. Rv^s / 0. (6) 

generalizes v 

Each G (S U {*}) m induces a group, denoted G v * , which consists of all vectors whose QI values 
are generalized to v*. Those groups together form a partition (note that some group may be empty). 
Denoting by C v * the number of '*'s in v*, the cost of the partition is precisely Ylv* v s C«* ~x(v* ,v, s). 
Thus the objective function is 

Minimize ^2 C v * -x(v*,v,s) . (7) 

v* ,v,s 

We still need other constraints to ensure that each group G v * either is empty or has i-closeness. We 
do this by adding a set of constraints, for every v*, that characterizes the transportation between 
the SA distributions of G v * and T as in the definition of EMD. First assume that Gy* is non- 
empty. We have \G V »\ = X^e£ m sge s x i v * ■• v i s )- The probability mass of i G S s in P(G V *) is 
Y,veY, m x i v *i v,i)/\G v * |, and that in P(T) is Y^vez™ r v,i/ n - For i,j G S s , let f(v*,i,j) denote the 
amount of mass moved from i to j in order to transform P(G V *) to P(T). Let dij be the distance 
between i and j in the SA space. To guarantee the i-closeness of G v * we can write the following 
constraints: 

^2f(v*,i,j) = ^2 x(y*,v,i)/\G v *\, Vi G S s 
= ^2 r v,j/n, Vj G S s 

iGS s t>eE m 

f(v*,i,j) > 0, Vi,j G S s . 

The first constraint above is not linear. To overcome this, we define g(v*,i,j) = f(v*,i,j) ■ \G V *\, 
substitute g(v*,i,j) for f(v*,i,j) in the above constraints, and expand \G V *\. This produces the 
following equivalent constraints: 

^2 g(v*,i,j) = ^2 x(v*,v,i), Vi £Y, S 
n^2g(v*,i,j) = ^2 r vJ ^2 x(v* ,v, s), Vj G S s 

^ d id ■ g(v*,i,j) < t- J2 x(v*,v,s) 

g(v*,i,j) > 0, Vi, j G S s . 

Note that these constraints hold even if G v * is empty. Thus they force group G v * to be i-closeness 
or empty. The set of such constraints for all v*, together with © and (|7|), compose a mixed 
integer linear program (i.e., only some of the variables are required to take integer values) that 
precisely characterizes the i-CLOSENESS problem on The number of variables in the program 



§A technical issue here is that, in order to apply results for mixed integer linear program, t needs to be a rational 
number. Nevertheless, for irrational t we can use rationals to approximate the value of t to an arbitrary precision. 
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is N < |S| m (|S| + l) m |S s | + (|S| + l) m |S s | 2 < 2(|E| + l) 2 "^ 1 . The time spent on constructing and 
writing down this linear program is polynomial in n, m, and N. By the result in [16] (Section 5 of it 
deals with mixed ILP), a mixed linear integer program with N variables can be solved in N°^L 
time, where L is the number of bits used to encode the program. In our case L is polynomial in n 
and m. Therefore, we can solve this program, and hence solve i-CLOSENESS, in h(m, \T,\)n°^ time 
for some function h. This shows that t-CLOSENESS is fixed-parameter tractable when parameterized 
by m and |E|. □ 



5 Approximation Algorithm for /c-Anonymity 

In this section we give a polynomial-time m-approximation algorithm for /c-Anonymity, which 
improves the previous best ratio 0(k) [TJ and 0(log£;) [23] when k is relatively large compared 
with m. (We note that the 0(log /^-approximation algorithm given in [23] is not guaranteed to run 
in polynomial time for super-constant k, while our result holds for all k.) 

Theorem 10. A; -Anonymity can be approximated within factor m in polynomial time. 

Proof. Consider a table T with n rows and m QI columns. Denote by OPT the minimum cost of 
any £;-anonymous partition of T ■ Partition T into "equivalence classes" C\ , . . . , Cr in the following 
sense: any two vectors in the same class are identical, i.e., they have the same value on each 
attribute, while any two vectors from different classes differ on at least one attribute. Assume 
\C\\ < [C2I < ... < \Cr\. If \C\\ > k, then these classes form a fc-anonymous partition with cost 
0, which is surely optimal. Thus we assume |Ci| < k, and let L € [R] be the maximum integer for 
which \Cl\ < k. Then \Cl>\ > k for all L < V < R. It is clear that each vector in C\ U . . . U Cl 
contributes at least one to the cost of any partition of T. Thus OPT > J2i=i [Ci|- 

Case 1: £)i=i |Cj| > k. In this case we partition T into R — L + 1 groups: {C\ U . . . U 
Cl, Cl+i, Cl+2, ■ ■ ■ , Cr}- This is a ^-anonymous partition of cost at most rn-J2i=\ |Cj| < m-OPT. 

Case 2: Y^ =l \d\ < k and ^f =1 \d\ + Y,i=L+i(\ C i\ ~ k) > k. We choose C[ C Q for L + 1 < 
i < R satisfying that |Ci\C|| > k and J2i=i \Ci\ + J2^L+i l^il = ^- This can be done because of the 
second condition of this case. We partition T into R — L + 1 groups: {U^=i C« U U£l+i ^ii Cl+i \ 
C' L+1 , . . . , Cr \ C' R }. This is a /c-anonymous partition of cost at most m ■ k < m • OPT, since 
OPT>k. 

Case 3: Ym=i \C%\ + Yl<i=L+i(\Ci\ — k) < k. We claim that there exists i 6 {L + 1, . . . , R} such 
that any vector in Cj contributes at least one to the cost of any A;-anonymous partition. Assume 
the contrary. Then there exists a fc-anonymous partition such that, for every L + 1 < i < R, there 
is a vector v G Cj whose suppression cost is 0, which means that v belongs to a group that only 
contains vectors in C,; denote this group by C[. We also know that there is at least one group in the 
partition that has positive cost. However, by removing all C[, L + 1 < i < R, from T, the number 
of vectors left is at most n-k(R-L) = ^f=i \Ci\ - k(R - L) = £)f =1 \C t \ + E^l+iG^I -k) <k, 
due to the condition of this case. This contradicts with the property of A;-anonymous partitions. 
Therefore the claim holds, i.e., there exists j € {L + l, . . . , R} such that any vector in Cj contributes 
at least one to the partition cost. Thus we have OPT > J2i=i \Ci\ + \Cj\ > We partition 

T into R — L groups: {U^o 1 Ci, Cl + 2, ■ ■ ■ , Cr}. This is a fe-anonymous partition with cost at most 
m-J2i=i\Ci\ < m-OPT. 

By the above case analyses, we can always find in polynomial time a A;-anonymous partition of 
T with cost at most m ■ OPT. This completes the proof of Theorem [10] □ 
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We note that Theorem [10] implies that A;- Anonymity can be solved optimally in polynomial 
time when m = 1. This is in contrast to /-Diversity, which remains NP-hard when m = 1 (with 
unbounded I) [9]. 

6 Algorithm for 2-Diversity 

In this part we give the first polynomial time algorithm for solving 2-Diversity. Let T be an 
input table of 2-Diversity. The following lemma is crucial to our algorithm. 

Lemma 2. There is an optimal 2-diverse partition of T in which every group consists of 2 or 3 
vectors with distinct SA values. 

Proof. It suffices to show that any 2-diverse sub-table M C T can be further partitioned into 
groups each of which consists of 2 or 3 vectors with distinct SA values (note that partitioning a 
group does not increase the generalization cost). We use induction on the size of M. When \M\ = 2 
or 3 it can be verified directly. Now consider M C T of size t > 4. Suppose M contains k SA 
values {1, 2, . . . , k} where k > 2. Let be the number of vectors in M with SA value i, for i 6 [k]. 
Assume w.l.o.g. that a\ > 02 > . . . > ctfc. Let A\ and A2 be two vectors with SA value 1 and 2, 
respectively. Partition M into {A\, A2} and M' = M \ {Ai, A2}. We only need to show that M' is 
2-diverse, so that we can use induction on it. We perform a case analysis as follows. 

• a% = 1. Then M' consists of at least two vectors with distinct SA values, and thus is 2-diverse. 

• k = 2. Since M is 2-diverse, we have a\ = a<i- Then M' still contains the same number of SA 
values 1 and 2, so it remains 2-diverse. 

• a \ > 2, k > 3, a± > (J3. The highest frequency of any SA value in M' is a\ — 1 < |M|/2 — 1 = 
\M'\/2, and thus M' is 2-diverse. 

• a\ = 03 > 2, k > 3. In this case \M\ > 3a3. The highest frequency of an SA value in M 1 is 
a 3 . We have \M'\ - 2a 3 = \M\ - 2 - 2a 3 > a 3 - 2 > 0, so M' is 2-diverse. 

All the cases are covered above and hence Lemma [2] is proved. □ 

Giving Lemma [21 the rest of the proof is basically the same with that of the polynomial-time 
tractability of 2- Anonymity given in [3]. We restate the proof for completeness. We reduce 2- 
Diversity to a combinatorial problem called Simplex Matching introduced in [2], which admits a 
polynomial algorithm [2]. The input of Simplex Matching is a hypergraph H = (V, E) containing 
edges of sizes 2 and 3 with nonnegative edge costs c(e) for all edges e G E. In addition H is guar- 
anteed to satisfy the following simplex condition: if {vi, V2, V3} G E, then {v±, V2}, {t>2, ^3}, {^3,^1} 
are also in E, and c({vi,V2}) + c({v2,vs}) + c({vx,Vs}) < 2 • c({vi, V2, V3}). The goal is to find 
a perfect matching of H (i.e., a set of edges that cover every vertex v £ V exactly once) with 
minimum cost (which is the sum of costs of all chosen edges). 

Let T be an input table of 2-Diversity. We construct a hypergraph H = (V, E) as fol- 
lows. Let V = {vi,V2, ■ ■ ■ ,v n } where Vi corresponds to the vector T[i\. For every two vectors 
7"[i],T[j] (or three vectors T[i], T\j], T[k]) with distinct SA values, there is an edge e = {v{,Vj} 
(or e = {vi, Vj, Vk}) with cost equal to cost({T[i],T[j]}) (or cost({T[i],T\j],T[k]})). Consider any 
3D edge e = {vi,Vj,Vk}- Since each column that needs to be suppressed in {T[i],T[j]} must also 
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be suppressed in {T[i], T\j], T[k]}, we have c(e)/3 > c({vi,Vj})/2. Similarly, c(e)/3 > c({vi,Vk})/2 
and c(e)/3 > c({vj ,Vk})/2. Summing the inequalities up gives 2c(e) > c({vi,Uj}) + c({uj, + 
c({vj,Vk})- Therefore H satisfies the simplex condition, and it clearly can be constructed in poly- 
nomial time. Call a 2-diverse partition of T good if every group in it consists of 2 or 3 vectors with 
distinct SA values. Lemma [2] shows that there is an optimal 2-diverse partition that is good. By 
the construction of H, each good 2-diverse partition of T can be easily transformed to a perfect 
matching of H with the same cost, and vice versa. Hence, we can find an optimal 2-diverse partition 
of T by using the polynomial time algorithm for Simplex Matching [2j. We thus have: 

Theorem 11. ^-Diversity is solvable in polynomial time. 

7 Conclusions 

This paper presents the first theoretical study on the i-closeness principle for privacy preserving. 
We prove the NP-hardness of the t-CLOSENESS problem for every constant t € [0,1), and give 
exact and fixed-parameter algorithms for the problem. We also provide conditionally improved 
approximation algorithm for &;- Anonymity, and give the first polynomial time exact algorithm for 

2-DlVERSITY. 

There are still many related problems that deserve further explorations, amongst which the 
most interesting one to the authors is designing polynomial time approximation algorithms for 
^-Closeness with provable performance guarantees. We conjecture that the best approximation 
ratio may be dependent on n (e.g., O(logn)). The parameterized complexity of t-CLOSENESS with 
respect to other sets of parameters are also of interest. Some interesting parameters that have been 
studied for fc-anonymity can be found in [TTJ |6j [7] . 
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